Problem Set 4 - Simple Linear Regression
Instructions
Complete the following exercises based on ModernDive Chapter 5 - Simple Linear Regression. Before beginning, create a new R Markdown document and give it a YAML header that includes the title “HPAM 7660 Problem Set 4”, your name, the date, and “pdf_document” as the output format.
As you answer each of the following questions, be sure to include your R code and associated output in your R Markdown document. Additionally, add a line or two describing what you’re doing in each code chunk.
Steps for Completing the Assignment
Install and load the
moderndivepackage. This package contains the datasets and helper functions we’ll be using throughout this problem set. You’ll also want to load thedplyrpackage.We’ll start by exploring the
un_member_states_2024dataset, which contains data on 181 UN member states and includes variables for life expectancy (life_expectancy_2022), fertility rate (fertility_rate_2022), and obesity rate (obesity_rate_2024). Use theglimpse()function to preview the dataset and then usetidy_summary()to generate summary statistics for all variables. How would you describe the typical life expectancy and fertility rate across countries in the data?Now let’s fit a simple linear regression model with
fertility_rate_2022as the outcome andlife_expextancy_2022as the explanatory variable. Use thelm()function to fit the model and thecoef()function to display the estimated coefficients. Write out the regression equation using the estimated values of the intercept and slope.Interpret the slope coefficient from your regression. In practical terms, what does a one-year increase in life expectancy imply for a country’s fertility rate? Be sure to use the word “associated” in your answer and explain why we use that language rather than saying life expectancy causes changes in fertility.
Use the
get_regression_points()function to generate a data frame of observed values, fitted values, and residuals. Does the United States have a higher or lower fertility rate than the model predicts? What does that mean in practical terms?Now let’s shift to examining a categorical explanatory variable. Fit a linear regression model with
life_expextancy_2022as the outcome andcontinentas the explanatory variable. Display the coefficients. Which continent serves as the baseline for comparison, and how do you know? What is the model’s predicted life expectancy for countries in Europe?Interpret the coefficient on
continentAsia. What does this value tell you about life expectancy in Asian countries relative to the baseline group? Is Asia above or below the baseline, and by how much?Use
get_regression_points()with theID = "country"argument to retrieve the fitted values and residuals for the continent model. Identify the five countries with the most negative residuals. What do large negative residuals tell us about these countries relative to others in their continent?Section 5.3.1 of ModernDive discusses the important distinction between correlation and causation. In your own words, what is a confounding variable, and why does its presence make it difficult to draw causal conclusions from a simple regression? Use the life expectancy and fertility rate example from the chapter to illustrate your answer.
Once you’ve finished Step 12, knit your PDF document, upload it to the Problem Set 4 assignment link on Canvas and you’re done!